{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1ea7ad9f",
   "metadata": {},
   "source": [
    "(evaluating-llms)="
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ac4f331-7ce4-4b99-80ad-2b5607a42c4f",
   "metadata": {},
   "source": [
    "# Evaluating LLMs with MLRun\n",
    "\n",
    "This example guides you through setup, creating an evaluation function, running an evaluation job, and viewing the logged output."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ce305d8-0cf4-4426-a314-9c11dd217a8f",
   "metadata": {},
   "source": [
    "Evaluating large language models (LLMs) is crucial throughout the ML lifecycle. During development, thorough evaluation enables users to refine prompts, select models, and tune hyperparameters. During production, real-time evaluation and guardrails ensure that responses from LLMs are reliable, consistent, and relevant in real-world applications.\n",
    "\n",
    "**In this section**\n",
    "- [Challenges in evaluating LLMs](#challenges-in-evaluating-llms)\n",
    "- [Metrics overview](#metrics-overview)\n",
    "- [Prerequisite](#prerequisite)\n",
    "- [Setup](#setup)\n",
    "- [Select OpenAI or Qwen Model](#select-openai-or-qwen-model)\n",
    "- [Example evaluation task](#example-evaluation-task)\n",
    "- [Create an evaluation function](#create-an-evaluation-function)\n",
    "- [Run an evaluation job](#run-an-evaluation-job)\n",
    "- [View the logged output](#view-the-logged-output)\n",
    "\n",
    "## Challenges in evaluating LLMs\n",
    "\n",
    "Evaluating Large Language Models (LLMs) comes with its own set of challenges:\n",
    "\n",
    "- **Lack of Standardization**: There is no single, universally accepted evaluation framework or metrics suite for LLMs. This makes it difficult to compare and benchmark different models across various tasks. There are a number of benchmark datasets for evaluating LLM performance such as [GSM8K](https://huggingface.co/datasets/openai/gsm8k), however these are not always representative of real-world performance.\n",
    "\n",
    "- **Complexity of Evaluation Tasks**: Many evaluation tasks are complex and multifaceted, involving multiple aspects such as factual accuracy, coherence, fluency, and relevance. Additionally, LLMs are prone to hallucination meaning that the final response may deviate from some provided, factually correct context.\n",
    "\n",
    "- **Subjectivity in Evaluation**: Evaluation metrics and tasks can be subjective, making it challenging to determine the \"ground truth\" or what constitutes a correct answer. This subjectivity can lead to varying evaluation results across different evaluators or even the same evaluator at different times.\n",
    "\n",
    "- **Human Judgement Resources**: Limited human judgement resources make large-scale evaluations using human judgements impractical. This limitation highlights the need for automated and scalable evaluation methods.\n",
    "\n",
    "## Metrics overview\n",
    "\n",
    "Open source frameworks such as [Deepeval](https://docs.confident-ai.com/docs/getting-started) offer a range of metrics to evaluate various aspects of an LLM's output. In particular, the following metrics are related to comparing the LLM's response to some provided context (like from a RAG system) to ensure the response is high quality, factually correct, and representative of the external knowledge base.\n",
    "\n",
    "This example uses the following metrics:\n",
    "- **[Answer Relevancy](https://docs.confident-ai.com/docs/metrics-answer-relevancy)**: Measures the quality of an LLM's generator by evaluating how relevant the actual output is compared to the provided input.\n",
    "- **[Faithfulness](https://docs.confident-ai.com/docs/metrics-faithfulness)**: Evaluates whether the actual output factually aligns with the contents of the retrieval context.\n",
    "- **[Contextual Precision](https://docs.confident-ai.com/docs/metrics-contextual-precision)**: Assesses the LLM's retriever by evaluating whether nodes in the retrieval context that are relevant to the given input are ranked higher than irrelevant ones.\n",
    "- **[Contextual Recall](https://docs.confident-ai.com/docs/metrics-contextual-recall)**: Measures the quality of an LLM's retriever by evaluating the extent to which the retrieval context aligns with the expected output.\n",
    "- **[Contextual Relevancy](https://docs.confident-ai.com/docs/metrics-contextual-relevancy)**: Evaluates the overall relevance of the information presented in the retrieval context for a given input.\n",
    "\n",
    "## Prerequisite\n",
    "\n",
    "Install the required packages by running this command (one time only)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "384ed3fd-d9c7-4377-8dad-88a3b665d115",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# %pip install --upgrade deepeval==2.5.5 \"protobuf<3.20\" mlrun transformers torch torchvision lm-format-enforcer=0.10.12"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a78c352-f98b-4104-a5b6-b95bfe01eee4",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d6fa7d03-0b6c-4e71-9956-274c2068f699",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"DEEPEVAL_UPDATE_WARNING_OPT_OUT\"] = \"YES\"\n",
    "\n",
    "import json\n",
    "import mlrun\n",
    "import pandas as pd\n",
    "from deepeval import evaluate\n",
    "from deepeval.metrics import (\n",
    "    AnswerRelevancyMetric,\n",
    "    ContextualPrecisionMetric,\n",
    "    ContextualRecallMetric,\n",
    "    ContextualRelevancyMetric,\n",
    "    FaithfulnessMetric,\n",
    ")\n",
    "from deepeval.models.base_model import DeepEvalBaseLLM\n",
    "from deepeval.test_case import LLMTestCase\n",
    "from mlrun.utils import create_class\n",
    "from pydantic import BaseModel\n",
    "from lmformatenforcer import JsonSchemaParser\n",
    "from lmformatenforcer.integrations.transformers import (\n",
    "    build_transformers_prefix_allowed_tokens_fn,\n",
    ")\n",
    "import transformers\n",
    "from deepeval.metrics import JsonCorrectnessMetric\n",
    "import torch\n",
    "\n",
    "\n",
    "class QwenDeepEvalBaseLLM(DeepEvalBaseLLM):\n",
    "    def __init__(self, model_name: str, model, device=None):\n",
    "        self.model_name = model_name\n",
    "        self.device = (\n",
    "            device if device else (\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "        )\n",
    "        self.model = model\n",
    "\n",
    "    def load_model(self):\n",
    "        \"\"\"Return the loaded model.\"\"\"\n",
    "        return self.model\n",
    "\n",
    "    def generate(self, prompt: str, schema: BaseModel) -> BaseModel:\n",
    "        \"\"\"Generate a response based on the input prompt.\"\"\"\n",
    "        # inputs = self.model.tokenizer(prompt, return_tensors=\"pt\").to(self.device)\n",
    "        with torch.no_grad():\n",
    "            outputs = self.model(prompt)\n",
    "        # Output and load valid JSON\n",
    "        # output = self.tokenizer.decode(outputs[0], prefix_allowed_tokens_fn=prefix_function,skip_special_tokens=True)\n",
    "        parser = JsonSchemaParser(schema.model_json_schema())\n",
    "        prefix_function = build_transformers_prefix_allowed_tokens_fn(\n",
    "            self.model.tokenizer, parser\n",
    "        )\n",
    "        output_dict = self.model(prompt, prefix_allowed_tokens_fn=prefix_function)\n",
    "        output = output_dict[0][\"generated_text\"][len(prompt) :]\n",
    "        json_result = json.loads(output)\n",
    "\n",
    "        # Return valid JSON object according to the schema DeepEval supplied\n",
    "        return schema(**json_result)\n",
    "\n",
    "    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:\n",
    "        \"\"\"Asynchronous version of the generate method.\"\"\"\n",
    "        return self.generate(prompt, schema)\n",
    "\n",
    "    def get_model_name(self) -> str:\n",
    "        \"\"\"Return the name of the model.\"\"\"\n",
    "        return self.model_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f264a56-f226-46c2-8425-47b4f60c953e",
   "metadata": {},
   "source": [
    "## Select OpenAI or Qwen Model\n",
    "* Support OpenAi gpt-4o or Qwen2-0.5B \n",
    "* Use Qwen2-0.5B for local and simple tests and CPU environments "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a54e0b59-2edb-4550-a55e-8a820f5d05e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "MODE = \"qwen\"  # or openai\n",
    "\n",
    "# If using openai mode, set the API key and base URL\n",
    "OPENAI_API_KEY = \"\"\n",
    "OPENAI_BASE_URL = \"https://api.openai.com/v1\"\n",
    "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\n",
    "os.environ[\"OPENAI_BASE_URL\"] = OPENAI_BASE_URL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "837bffe3-281a-4c01-9b6d-9cb31a5639bb",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.\n",
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using mode: QWEN\n",
      "\n",
      "LLM model: Qwen/Qwen2-0.5B\n"
     ]
    }
   ],
   "source": [
    "if MODE == \"openai\":\n",
    "    name = model = \"gpt-4o\"\n",
    "if MODE == \"qwen\":\n",
    "    name = \"Qwen/Qwen2-0.5B\"\n",
    "    model = transformers.pipeline(\n",
    "        \"text-generation\",\n",
    "        model=\"Qwen/Qwen2-0.5B\",\n",
    "        framework=\"pt\",\n",
    "        device_map=\"cpu\",\n",
    "        do_sample=True,\n",
    "        num_return_sequences=1,\n",
    "        max_new_tokens=10000,\n",
    "    )\n",
    "    model = QwenDeepEvalBaseLLM(model_name=\"Qwen/Qwen2-0.5B\", model=model, device=\"cpu\")\n",
    "\n",
    "\n",
    "print(f\"Using mode: {MODE.upper()}\\n\")\n",
    "print(f\"LLM model: {name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8efdbea5-733f-4828-890a-fa7a6d4eaeb9",
   "metadata": {},
   "source": [
    "## Example evaluation task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "906daacb-4dd8-471b-a324-9304eb77a60d",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">✨ You're running DeepEval's latest <span style=\"color: #6a00ff; text-decoration-color: #6a00ff\">Faithfulness Metric</span>! <span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">(</span><span style=\"color: #374151; text-decoration-color: #374151\">using Qwen/Qwen2-</span><span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">0.</span><span style=\"color: #374151; text-decoration-color: #374151\">5B, </span><span style=\"color: #374151; text-decoration-color: #374151\">strict</span><span style=\"color: #374151; text-decoration-color: #374151\">=</span><span style=\"color: #374151; text-decoration-color: #374151; font-style: italic\">False</span><span style=\"color: #374151; text-decoration-color: #374151\">, </span><span style=\"color: #374151; text-decoration-color: #374151\">async_mode</span><span style=\"color: #374151; text-decoration-color: #374151\">=</span><span style=\"color: #374151; text-decoration-color: #374151; font-style: italic\">True</span><span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">)</span><span style=\"color: #374151; text-decoration-color: #374151\">...</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "✨ You're running DeepEval's latest \u001b[38;2;106;0;255mFaithfulness Metric\u001b[0m! \u001b[1;38;2;55;65;81m(\u001b[0m\u001b[38;2;55;65;81musing Qwen/Qwen2-\u001b[0m\u001b[1;38;2;55;65;81m0.\u001b[0m\u001b[38;2;55;65;81m5B, \u001b[0m\u001b[38;2;55;65;81mstrict\u001b[0m\u001b[38;2;55;65;81m=\u001b[0m\u001b[3;38;2;55;65;81mFalse\u001b[0m\u001b[38;2;55;65;81m, \u001b[0m\u001b[38;2;55;65;81masync_mode\u001b[0m\u001b[38;2;55;65;81m=\u001b[0m\u001b[3;38;2;55;65;81mTrue\u001b[0m\u001b[1;38;2;55;65;81m)\u001b[0m\u001b[38;2;55;65;81m...\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 02:23, 143.41s/test case]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "======================================================================\n",
      "\n",
      "Metrics Summary\n",
      "\n",
      "  - ✅ Faithfulness (score: 0.5, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: The score is.50 because The actual output [Your Output] doesn't match the retrieval context [Your Retrieval Context] due to the following contradictions: \n",
      "- False Positive [Your Output], \n",
      "- False Negative [Your Output], \n",
      "- False Controversial [Your Output], \n",
      "- False Controversial [Your Retrieval Context], and \n",
      "- False Controversial [Your Output] when compared to the actual output of [Your Output] using their score of [Your Retrieval Score ] based on information in their respective columns. They [Your Retrieval Context] should have 10% more confidence in [Your Output] compared to [Your Retrieval Context]., error: None)\n",
      "\n",
      "For test case:\n",
      "\n",
      "  - input: I'm on an F-1 visa, how long can I stay in the US after graduation?\n",
      "  - actual output: You can stay up to 30 days after completing your degree.\n",
      "  - expected output: You can stay up to 60 days after completing your degree.\n",
      "  - context: None\n",
      "  - retrieval context: ['If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\\n        your degree, unless you have applied for and been approved to participate in OPT.']\n",
      "\n",
      "======================================================================\n",
      "\n",
      "Overall Metric Pass Rates\n",
      "\n",
      "Faithfulness: 100.00% pass rate\n",
      "\n",
      "======================================================================\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "<span style=\"color: #05f58d; text-decoration-color: #05f58d\">✓</span> Tests finished 🎉! Run <span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">'deepeval login'</span> to save and analyze evaluation results on Confident AI.\n",
       " \n",
       "✨👀 Looking for a place for your LLM test data to live 🏡❤️ ? Use <span style=\"color: #6a00ff; text-decoration-color: #6a00ff\">Confident AI</span> to get &amp; share testing reports, \n",
       "experiment with models/prompts, and catch regressions for your LLM system. Just run <span style=\"color: #008080; text-decoration-color: #008080\">'deepeval login'</span> in the CLI. \n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n",
       "\u001b[38;2;5;245;141m✓\u001b[0m Tests finished 🎉! Run \u001b[1;32m'deepeval login'\u001b[0m to save and analyze evaluation results on Confident AI.\n",
       " \n",
       "✨👀 Looking for a place for your LLM test data to live 🏡❤️ ? Use \u001b[38;2;106;0;255mConfident AI\u001b[0m to get & share testing reports, \n",
       "experiment with models/prompts, and catch regressions for your LLM system. Just run \u001b[36m'deepeval login'\u001b[0m in the CLI. \n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "test_case = LLMTestCase(\n",
    "    input=\"I'm on an F-1 visa, how long can I stay in the US after graduation?\",\n",
    "    actual_output=\"You can stay up to 30 days after completing your degree.\",\n",
    "    expected_output=\"You can stay up to 60 days after completing your degree.\",\n",
    "    retrieval_context=[\n",
    "        \"\"\"If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\n",
    "        your degree, unless you have applied for and been approved to participate in OPT.\"\"\"\n",
    "    ],\n",
    ")\n",
    "\n",
    "faithfulness = FaithfulnessMetric(model=model)\n",
    "\n",
    "results = evaluate(test_cases=[test_case], metrics=[faithfulness], use_cache=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3318aca-2273-457f-903c-b6b6d066aca3",
   "metadata": {},
   "source": [
    "## Create an evaluation function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6bea9fe2-f26d-4eaa-877d-bed0e8c34ecf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting evaluate_llm.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile evaluate_llm.py\n",
    "\n",
    "import os\n",
    "\n",
    "import mlrun\n",
    "import pandas as pd\n",
    "from deepeval import evaluate\n",
    "from deepeval.models.base_model import DeepEvalBaseLLM\n",
    "from deepeval.test_case import LLMTestCase\n",
    "from mlrun.utils import create_class\n",
    "\n",
    "def evaluate_llm(\n",
    "    test_cases: list[dict],\n",
    "    metrics: list[str],\n",
    "    model: str\n",
    "):\n",
    "    results = evaluate(\n",
    "        test_cases=[LLMTestCase(**t) for t in test_cases],\n",
    "        metrics=[create_class(m)(model=model) for m in metrics],\n",
    "        use_cache=True\n",
    "    )\n",
    "    \n",
    "    rows = []\n",
    "    for i, result in enumerate(results.test_results):\n",
    "        for metric in result.metrics_data:\n",
    "            result_dict = {\n",
    "                \"test\" : f\"test_case_{i}\",\n",
    "                \"actual_output\" : result.actual_output,\n",
    "                \"expected_output\" : result.expected_output,\n",
    "                \"context\" : result.context,\n",
    "                \"retrieval_context\" : result.retrieval_context,\n",
    "                \"user_input\" : result.input,\n",
    "                \"test_success\" : result.success,\n",
    "                \"metric_success\" : metric.success,\n",
    "                \"metric\" : metric.name,\n",
    "                \"evaluation_model\" : metric.evaluation_model,\n",
    "                \"metric_score\" : metric.score,\n",
    "                \"metric_reason\" : metric.reason,\n",
    "                \"evaluation_cost\": metric.evaluation_cost,\n",
    "                \"metric_threshold\" : metric.threshold,\n",
    "                \"metric_error\" : metric.error\n",
    "            }\n",
    "            rows.append(result_dict)\n",
    "    df = pd.DataFrame(rows)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f3ff847c-332b-4455-8dd2-74440f36c7be",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "> 2025-04-22 09:21:40,230 [info] Project loaded successfully: {\"project_name\":\"evaluate\"}\n"
     ]
    }
   ],
   "source": [
    "project = mlrun.get_or_create_project(\"evaluate\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3ad6846-ec67-4bc5-8c4e-0a774176a0d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluation_fn = project.set_function(\n",
    "    name=\"evaluate-llm\",\n",
    "    func=\"evaluate_llm.py\",\n",
    "    kind=\"job\",\n",
    "    image=\"mlrun/mlrun\",\n",
    "    handler=\"evaluate_llm\",\n",
    ")\n",
    "\n",
    "# Store OpenAI credentials as k8s secrets\n",
    "if MODE == \"openai\":\n",
    "    project.set_secrets(\n",
    "        {\"OPENAI_API_KEY\": OPENAI_API_KEY, \"OPENAI_BASE_URL\": OPENAI_BASE_URL}\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "81b348e2-a133-49e1-aea3-2a05c34c6ac3",
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics = [\n",
    "    \"deepeval.metrics.AnswerRelevancyMetric\",\n",
    "    \"deepeval.metrics.FaithfulnessMetric\",\n",
    "    \"deepeval.metrics.ContextualPrecisionMetric\",\n",
    "    \"deepeval.metrics.ContextualRecallMetric\",\n",
    "    \"deepeval.metrics.ContextualRelevancyMetric\",\n",
    "]\n",
    "\n",
    "if MODE == \"qwen\":\n",
    "    metrics = [\n",
    "        \"deepeval.metrics.AnswerRelevancyMetric\",\n",
    "        \"deepeval.metrics.FaithfulnessMetric\",\n",
    "    ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7e940fa-ca0d-458e-8cf7-0f7288df4ea3",
   "metadata": {},
   "source": [
    "## Run an evaluation job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "45b1e281-195c-40c5-b4d7-d5f128a77037",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "> 2025-04-22 09:21:41,057 [info] Storing function: {\"db\":\"http://mlrun-api:8080\",\"name\":\"evaluate-llm-evaluate-llm\",\"uid\":\"545dd486f49b4f8aae6e11d59d7f191d\"}\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">✨ You're running DeepEval's latest <span style=\"color: #6a00ff; text-decoration-color: #6a00ff\">Answer Relevancy Metric</span>! <span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">(</span><span style=\"color: #374151; text-decoration-color: #374151\">using Qwen/Qwen2-</span><span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">0.</span><span style=\"color: #374151; text-decoration-color: #374151\">5B, </span><span style=\"color: #374151; text-decoration-color: #374151\">strict</span><span style=\"color: #374151; text-decoration-color: #374151\">=</span><span style=\"color: #374151; text-decoration-color: #374151; font-style: italic\">False</span><span style=\"color: #374151; text-decoration-color: #374151\">, </span>\n",
       "<span style=\"color: #374151; text-decoration-color: #374151\">async_mode</span><span style=\"color: #374151; text-decoration-color: #374151\">=</span><span style=\"color: #374151; text-decoration-color: #374151; font-style: italic\">True</span><span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">)</span><span style=\"color: #374151; text-decoration-color: #374151\">...</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "✨ You're running DeepEval's latest \u001b[38;2;106;0;255mAnswer Relevancy Metric\u001b[0m! \u001b[1;38;2;55;65;81m(\u001b[0m\u001b[38;2;55;65;81musing Qwen/Qwen2-\u001b[0m\u001b[1;38;2;55;65;81m0.\u001b[0m\u001b[38;2;55;65;81m5B, \u001b[0m\u001b[38;2;55;65;81mstrict\u001b[0m\u001b[38;2;55;65;81m=\u001b[0m\u001b[3;38;2;55;65;81mFalse\u001b[0m\u001b[38;2;55;65;81m, \u001b[0m\n",
       "\u001b[38;2;55;65;81masync_mode\u001b[0m\u001b[38;2;55;65;81m=\u001b[0m\u001b[3;38;2;55;65;81mTrue\u001b[0m\u001b[1;38;2;55;65;81m)\u001b[0m\u001b[38;2;55;65;81m...\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">✨ You're running DeepEval's latest <span style=\"color: #6a00ff; text-decoration-color: #6a00ff\">Faithfulness Metric</span>! <span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">(</span><span style=\"color: #374151; text-decoration-color: #374151\">using Qwen/Qwen2-</span><span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">0.</span><span style=\"color: #374151; text-decoration-color: #374151\">5B, </span><span style=\"color: #374151; text-decoration-color: #374151\">strict</span><span style=\"color: #374151; text-decoration-color: #374151\">=</span><span style=\"color: #374151; text-decoration-color: #374151; font-style: italic\">False</span><span style=\"color: #374151; text-decoration-color: #374151\">, </span><span style=\"color: #374151; text-decoration-color: #374151\">async_mode</span><span style=\"color: #374151; text-decoration-color: #374151\">=</span><span style=\"color: #374151; text-decoration-color: #374151; font-style: italic\">True</span><span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">)</span><span style=\"color: #374151; text-decoration-color: #374151\">...</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "✨ You're running DeepEval's latest \u001b[38;2;106;0;255mFaithfulness Metric\u001b[0m! \u001b[1;38;2;55;65;81m(\u001b[0m\u001b[38;2;55;65;81musing Qwen/Qwen2-\u001b[0m\u001b[1;38;2;55;65;81m0.\u001b[0m\u001b[38;2;55;65;81m5B, \u001b[0m\u001b[38;2;55;65;81mstrict\u001b[0m\u001b[38;2;55;65;81m=\u001b[0m\u001b[3;38;2;55;65;81mFalse\u001b[0m\u001b[38;2;55;65;81m, \u001b[0m\u001b[38;2;55;65;81masync_mode\u001b[0m\u001b[38;2;55;65;81m=\u001b[0m\u001b[3;38;2;55;65;81mTrue\u001b[0m\u001b[1;38;2;55;65;81m)\u001b[0m\u001b[38;2;55;65;81m...\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Evaluating 2 test case(s) in parallel: |██████████|100% (2/2) [Time Taken: 05:12, 156.17s/test case]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "======================================================================\n",
      "\n",
      "Metrics Summary\n",
      "\n",
      "  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: The score is 1.00 because there is no relevant statement in the actual output that has >30 words, error: None)\n",
      "  - ❌ Faithfulness (score: 0.0, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: The score is 0.00 because the actual output does not align with the retrieval context, error: None)\n",
      "\n",
      "For test case:\n",
      "\n",
      "  - input: What are some benefits of MLRun?\n",
      "  - actual output: MLRun is an MLOps orchestration framework that enables you to develop, train, deploy, and manage machine learning models in a serverless environment. It provides a set of tools and APIs for building, testing, and deploying model serving functions.\n",
      "  - expected output: MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.\n",
      "  - context: None\n",
      "  - retrieval context: ['Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.', 'MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple \"local\" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.']\n",
      "\n",
      "======================================================================\n",
      "\n",
      "Metrics Summary\n",
      "\n",
      "  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: I'm on an F-1 visa, how long can I stay in the US after graduation? I need to apply for a US visa and my time in the US should not exceed two years to give up the F-1 visa, after graduation. Thank you, error: None)\n",
      "  - ✅ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: The score is <faithfulness_score> because <your_reason>.\n",
      "\n",
      ", error: None)\n",
      "\n",
      "For test case:\n",
      "\n",
      "  - input: I'm on an F-1 visa, how long can I stay in the US after graduation?\n",
      "  - actual output: You can stay up to 30 days after completing your degree.\n",
      "  - expected output: You can stay up to 60 days after completing your degree.\n",
      "  - context: None\n",
      "  - retrieval context: ['If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\\n                    your degree, unless you have applied for and been approved to participate in OPT.']\n",
      "\n",
      "======================================================================\n",
      "\n",
      "Overall Metric Pass Rates\n",
      "\n",
      "Answer Relevancy: 100.00% pass rate\n",
      "Faithfulness: 50.00% pass rate\n",
      "\n",
      "======================================================================\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "<span style=\"color: #05f58d; text-decoration-color: #05f58d\">✓</span> Tests finished 🎉! Run <span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">'deepeval login'</span> to save and analyze evaluation results on Confident AI.\n",
       " \n",
       "✨👀 Looking for a place for your LLM test data to live 🏡❤️ ? Use <span style=\"color: #6a00ff; text-decoration-color: #6a00ff\">Confident AI</span> to get &amp; share testing reports, \n",
       "experiment with models/prompts, and catch regressions for your LLM system. Just run <span style=\"color: #008080; text-decoration-color: #008080\">'deepeval login'</span> in the CLI. \n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n",
       "\u001b[38;2;5;245;141m✓\u001b[0m Tests finished 🎉! Run \u001b[1;32m'deepeval login'\u001b[0m to save and analyze evaluation results on Confident AI.\n",
       " \n",
       "✨👀 Looking for a place for your LLM test data to live 🏡❤️ ? Use \u001b[38;2;106;0;255mConfident AI\u001b[0m to get & share testing reports, \n",
       "experiment with models/prompts, and catch regressions for your LLM system. Just run \u001b[36m'deepeval login'\u001b[0m in the CLI. \n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Converting input from bool to <class 'numpy.uint8'> for compatibility.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<style>\n",
       ".dictlist {\n",
       "  background-color: #4EC64B;\n",
       "  text-align: center;\n",
       "  margin: 4px;\n",
       "  border-radius: 3px; padding: 0px 3px 1px 3px; display: inline-block;}\n",
       ".artifact {\n",
       "  cursor: pointer;\n",
       "  background-color: #4EC64B;\n",
       "  text-align: left;\n",
       "  margin: 4px; border-radius: 3px; padding: 0px 3px 1px 3px; display: inline-block;\n",
       "}\n",
       "div.block.hidden {\n",
       "  display: none;\n",
       "}\n",
       ".clickable {\n",
       "  cursor: pointer;\n",
       "}\n",
       ".ellipsis {\n",
       "  display: inline-block;\n",
       "  max-width: 60px;\n",
       "  white-space: nowrap;\n",
       "  overflow: hidden;\n",
       "  text-overflow: ellipsis;\n",
       "}\n",
       ".master-wrapper {\n",
       "  display: flex;\n",
       "  flex-flow: row nowrap;\n",
       "  justify-content: flex-start;\n",
       "  align-items: stretch;\n",
       "}\n",
       ".master-tbl {\n",
       "  flex: 3\n",
       "}\n",
       ".master-wrapper > div {\n",
       "  margin: 4px;\n",
       "  padding: 10px;\n",
       "}\n",
       "iframe.fileview {\n",
       "  border: 0 none;\n",
       "  height: 100%;\n",
       "  width: 100%;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       ".pane-header-title {\n",
       "  width: 80%;\n",
       "  font-weight: 500;\n",
       "}\n",
       ".pane-header {\n",
       "  line-height: 1;\n",
       "  background-color: #4EC64B;\n",
       "  padding: 3px;\n",
       "}\n",
       ".pane-header .close {\n",
       "  font-size: 20px;\n",
       "  font-weight: 700;\n",
       "  float: right;\n",
       "  margin-top: -5px;\n",
       "}\n",
       ".master-wrapper .right-pane {\n",
       "  border: 1px inset silver;\n",
       "  width: 40%;\n",
       "  min-height: 300px;\n",
       "  flex: 3\n",
       "  min-width: 500px;\n",
       "}\n",
       ".master-wrapper * {\n",
       "  box-sizing: border-box;\n",
       "}\n",
       "</style><script>\n",
       "function copyToClipboard(fld) {\n",
       "    if (document.queryCommandSupported && document.queryCommandSupported('copy')) {\n",
       "        var textarea = document.createElement('textarea');\n",
       "        textarea.textContent = fld.innerHTML;\n",
       "        textarea.style.position = 'fixed';\n",
       "        document.body.appendChild(textarea);\n",
       "        textarea.select();\n",
       "\n",
       "        try {\n",
       "            return document.execCommand('copy'); // Security exception may be thrown by some browsers.\n",
       "        } catch (ex) {\n",
       "\n",
       "        } finally {\n",
       "            document.body.removeChild(textarea);\n",
       "        }\n",
       "    }\n",
       "}\n",
       "function expandPanel(el) {\n",
       "  const panelName = \"#\" + el.getAttribute('paneName');\n",
       "\n",
       "  // Get the base URL of the current notebook\n",
       "  var baseUrl = window.location.origin;\n",
       "\n",
       "  // Construct the full URL\n",
       "  var fullUrl = new URL(el.title, baseUrl).href;\n",
       "\n",
       "  document.querySelector(panelName + \"-title\").innerHTML = fullUrl\n",
       "  iframe = document.querySelector(panelName + \"-body\");\n",
       "\n",
       "  const tblcss = `<style> body { font-family: Arial, Helvetica, sans-serif;}\n",
       "    #csv { margin-bottom: 15px; }\n",
       "    #csv table { border-collapse: collapse;}\n",
       "    #csv table td { padding: 4px 8px; border: 1px solid silver;} </style>`;\n",
       "\n",
       "  function csvToHtmlTable(str) {\n",
       "    return '<div id=\"csv\"><table><tr><td>' +  str.replace(/[\\n\\r]+$/g, '').replace(/[\\n\\r]+/g, '</td></tr><tr><td>')\n",
       "      .replace(/,/g, '</td><td>') + '</td></tr></table></div>';\n",
       "  }\n",
       "\n",
       "  function reqListener () {\n",
       "    if (fullUrl.endsWith(\".csv\")) {\n",
       "      iframe.setAttribute(\"srcdoc\", tblcss + csvToHtmlTable(this.responseText));\n",
       "    } else {\n",
       "      iframe.setAttribute(\"srcdoc\", this.responseText);\n",
       "    }\n",
       "    console.log(this.responseText);\n",
       "  }\n",
       "\n",
       "  const oReq = new XMLHttpRequest();\n",
       "  oReq.addEventListener(\"load\", reqListener);\n",
       "  oReq.open(\"GET\", fullUrl);\n",
       "  oReq.send();\n",
       "\n",
       "\n",
       "  //iframe.src = fullUrl;\n",
       "  const resultPane = document.querySelector(panelName + \"-pane\");\n",
       "  if (resultPane.classList.contains(\"hidden\")) {\n",
       "    resultPane.classList.remove(\"hidden\");\n",
       "  }\n",
       "}\n",
       "function closePanel(el) {\n",
       "  const panelName = \"#\" + el.getAttribute('paneName')\n",
       "  const resultPane = document.querySelector(panelName + \"-pane\");\n",
       "  if (!resultPane.classList.contains(\"hidden\")) {\n",
       "    resultPane.classList.add(\"hidden\");\n",
       "  }\n",
       "}\n",
       "\n",
       "</script>\n",
       "<div class=\"master-wrapper\">\n",
       "  <div class=\"block master-tbl\"><div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>project</th>\n",
       "      <th>uid</th>\n",
       "      <th>iter</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>state</th>\n",
       "      <th>kind</th>\n",
       "      <th>name</th>\n",
       "      <th>labels</th>\n",
       "      <th>inputs</th>\n",
       "      <th>parameters</th>\n",
       "      <th>results</th>\n",
       "      <th>artifact_uris</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>evaluate</td>\n",
       "      <td><div title=\"545dd486f49b4f8aae6e11d59d7f191d\"><a href=\"https://dashboard.default-tenant.app.innovation-dev.iguazio-cd2.com/mlprojects/evaluate/jobs/monitor/545dd486f49b4f8aae6e11d59d7f191d/overview\" target=\"_blank\" >...9d7f191d</a></div></td>\n",
       "      <td>0</td>\n",
       "      <td>Apr 22 09:21:41</td>\n",
       "      <td>NaT</td>\n",
       "      <td>completed</td>\n",
       "      <td>run</td>\n",
       "      <td>evaluate-llm-evaluate-llm</td>\n",
       "      <td><div class=\"dictlist\">v3io_user=shapira</div><div class=\"dictlist\">kind=local</div><div class=\"dictlist\">owner=shapira</div><div class=\"dictlist\">host=jupyter-shapira-665ddf954b-jscr6</div></td>\n",
       "      <td></td>\n",
       "      <td><div class=\"dictlist\">test_cases=[{'input': \"I'm on an F-1 visa, how long can I stay in the US after graduation?\", 'actual_output': 'You can stay up to 30 days after completing your degree.', 'expected_output': 'You can stay up to 60 days after completing your degree.', 'retrieval_context': ['If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\\n                    your degree, unless you have applied for and been approved to participate in OPT.']}, {'input': 'What are some benefits of MLRun?', 'actual_output': 'MLRun is an MLOps orchestration framework that enables you to develop, train, deploy, and manage machine learning models in a serverless environment. It provides a set of tools and APIs for building, testing, and deploying model serving functions.', 'expected_output': 'MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.', 'retrieval_context': ['Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.', 'MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple \"local\" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.']}]</div><div class=\"dictlist\">metrics=['deepeval.metrics.AnswerRelevancyMetric', 'deepeval.metrics.FaithfulnessMetric']</div><div class=\"dictlist\">model=<__main__.QwenDeepEvalBaseLLM object at 0x7f550150b520></div></td>\n",
       "      <td></td>\n",
       "      <td><div class=\"dictlist\">evaluation=store://datasets/evaluate/evaluate-llm-evaluate-llm_evaluation#0@545dd486f49b4f8aae6e11d59d7f191d^fb1d004984e59cf6958022ebe095bd251d91e154</div></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div></div>\n",
       "  <div id=\"resultbee7d011-pane\" class=\"right-pane block hidden\">\n",
       "    <div class=\"pane-header\">\n",
       "      <span id=\"resultbee7d011-title\" class=\"pane-header-title\">Title</span>\n",
       "      <span onclick=\"closePanel(this)\" paneName=\"resultbee7d011\" class=\"close clickable\">&times;</span>\n",
       "    </div>\n",
       "    <iframe class=\"fileview\" id=\"resultbee7d011-body\"></iframe>\n",
       "  </div>\n",
       "</div>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<b> > to track results use the .show() or .logs() methods  or <a href=\"https://dashboard.default-tenant.app.innovation-dev.iguazio-cd2.com/mlprojects/evaluate/jobs/monitor-jobs/evaluate-llm-evaluate-llm/545dd486f49b4f8aae6e11d59d7f191d/overview\" target=\"_blank\">click here</a> to open in UI</b>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "> 2025-04-22 09:26:55,309 [info] Run execution finished: {\"name\":\"evaluate-llm-evaluate-llm\",\"status\":\"completed\"}\n",
      "CPU times: user 29min 36s, sys: 8.89 s, total: 29min 45s\n",
      "Wall time: 5min 15s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "evaluation_run = project.run_function(\n",
    "    evaluation_fn,\n",
    "    params={\n",
    "        \"test_cases\": [\n",
    "            dict(\n",
    "                input=\"I'm on an F-1 visa, how long can I stay in the US after graduation?\",\n",
    "                actual_output=\"You can stay up to 30 days after completing your degree.\",\n",
    "                expected_output=\"You can stay up to 60 days after completing your degree.\",\n",
    "                retrieval_context=[\n",
    "                    \"\"\"If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\n",
    "                    your degree, unless you have applied for and been approved to participate in OPT.\"\"\"\n",
    "                ],\n",
    "            ),\n",
    "            dict(\n",
    "                input=\"What are some benefits of MLRun?\",\n",
    "                actual_output=\"MLRun is an MLOps orchestration framework that enables you to develop, train, deploy, and manage machine learning models in a serverless environment. It provides a set of tools and APIs for building, testing, and deploying model serving functions.\",\n",
    "                expected_output=\"MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.\",\n",
    "                retrieval_context=[\n",
    "                    \"\"\"Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.\"\"\",\n",
    "                    \"\"\"MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple \"local\" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.\"\"\",\n",
    "                ],\n",
    "            ),\n",
    "        ],\n",
    "        \"metrics\": metrics,\n",
    "        \"model\": model,\n",
    "    },\n",
    "    local=True,\n",
    "    outputs=[\"evaluation\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15df7c97-449f-47f2-95a1-96fe2df81e8c",
   "metadata": {
    "tags": []
   },
   "source": [
    "## View the logged output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7ee909d8-89b6-41da-88a9-64ccc8ecc5e4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>test</th>\n",
       "      <th>actual_output</th>\n",
       "      <th>expected_output</th>\n",
       "      <th>context</th>\n",
       "      <th>retrieval_context</th>\n",
       "      <th>user_input</th>\n",
       "      <th>test_success</th>\n",
       "      <th>metric_success</th>\n",
       "      <th>metric</th>\n",
       "      <th>evaluation_model</th>\n",
       "      <th>metric_score</th>\n",
       "      <th>metric_reason</th>\n",
       "      <th>evaluation_cost</th>\n",
       "      <th>metric_threshold</th>\n",
       "      <th>metric_error</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>test_case_0</td>\n",
       "      <td>MLRun is an MLOps orchestration framework that...</td>\n",
       "      <td>MLRun is an open MLOps platform for quickly bu...</td>\n",
       "      <td>None</td>\n",
       "      <td>[Instead of a siloed, complex, and manual proc...</td>\n",
       "      <td>What are some benefits of MLRun?</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>Answer Relevancy</td>\n",
       "      <td>Qwen/Qwen2-0.5B</td>\n",
       "      <td>1.0</td>\n",
       "      <td>The score is 1.00 because there is no relevant...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>test_case_0</td>\n",
       "      <td>MLRun is an MLOps orchestration framework that...</td>\n",
       "      <td>MLRun is an open MLOps platform for quickly bu...</td>\n",
       "      <td>None</td>\n",
       "      <td>[Instead of a siloed, complex, and manual proc...</td>\n",
       "      <td>What are some benefits of MLRun?</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Faithfulness</td>\n",
       "      <td>Qwen/Qwen2-0.5B</td>\n",
       "      <td>0.0</td>\n",
       "      <td>The score is 0.00 because the actual output do...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>test_case_1</td>\n",
       "      <td>You can stay up to 30 days after completing yo...</td>\n",
       "      <td>You can stay up to 60 days after completing yo...</td>\n",
       "      <td>None</td>\n",
       "      <td>[If you are in the U.S. on an F-1 visa, you ar...</td>\n",
       "      <td>I'm on an F-1 visa, how long can I stay in the...</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>Answer Relevancy</td>\n",
       "      <td>Qwen/Qwen2-0.5B</td>\n",
       "      <td>1.0</td>\n",
       "      <td>I'm on an F-1 visa, how long can I stay in the...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.5</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>test_case_1</td>\n",
       "      <td>You can stay up to 30 days after completing yo...</td>\n",
       "      <td>You can stay up to 60 days after completing yo...</td>\n",
       "      <td>None</td>\n",
       "      <td>[If you are in the U.S. on an F-1 visa, you ar...</td>\n",
       "      <td>I'm on an F-1 visa, how long can I stay in the...</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>Faithfulness</td>\n",
       "      <td>Qwen/Qwen2-0.5B</td>\n",
       "      <td>1.0</td>\n",
       "      <td>The score is &lt;faithfulness_score&gt; because &lt;you...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.5</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          test                                      actual_output  \\\n",
       "0  test_case_0  MLRun is an MLOps orchestration framework that...   \n",
       "1  test_case_0  MLRun is an MLOps orchestration framework that...   \n",
       "2  test_case_1  You can stay up to 30 days after completing yo...   \n",
       "3  test_case_1  You can stay up to 30 days after completing yo...   \n",
       "\n",
       "                                     expected_output context  \\\n",
       "0  MLRun is an open MLOps platform for quickly bu...    None   \n",
       "1  MLRun is an open MLOps platform for quickly bu...    None   \n",
       "2  You can stay up to 60 days after completing yo...    None   \n",
       "3  You can stay up to 60 days after completing yo...    None   \n",
       "\n",
       "                                   retrieval_context  \\\n",
       "0  [Instead of a siloed, complex, and manual proc...   \n",
       "1  [Instead of a siloed, complex, and manual proc...   \n",
       "2  [If you are in the U.S. on an F-1 visa, you ar...   \n",
       "3  [If you are in the U.S. on an F-1 visa, you ar...   \n",
       "\n",
       "                                          user_input  test_success  \\\n",
       "0                   What are some benefits of MLRun?         False   \n",
       "1                   What are some benefits of MLRun?         False   \n",
       "2  I'm on an F-1 visa, how long can I stay in the...          True   \n",
       "3  I'm on an F-1 visa, how long can I stay in the...          True   \n",
       "\n",
       "   metric_success            metric evaluation_model  metric_score  \\\n",
       "0            True  Answer Relevancy  Qwen/Qwen2-0.5B           1.0   \n",
       "1           False      Faithfulness  Qwen/Qwen2-0.5B           0.0   \n",
       "2            True  Answer Relevancy  Qwen/Qwen2-0.5B           1.0   \n",
       "3            True      Faithfulness  Qwen/Qwen2-0.5B           1.0   \n",
       "\n",
       "                                       metric_reason  evaluation_cost  \\\n",
       "0  The score is 1.00 because there is no relevant...              0.0   \n",
       "1  The score is 0.00 because the actual output do...              0.0   \n",
       "2  I'm on an F-1 visa, how long can I stay in the...              NaN   \n",
       "3  The score is <faithfulness_score> because <you...              NaN   \n",
       "\n",
       "   metric_threshold metric_error  \n",
       "0               0.5         None  \n",
       "1               0.5         None  \n",
       "2               0.5         None  \n",
       "3               0.5         None  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "evaluation_run.artifact(\"evaluation\").show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "464f63fe-af2d-4575-b2b8-f9e9aca853e0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mlrun-base",
   "language": "python",
   "name": "conda-env-mlrun-base-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.22"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
